9 research outputs found
Occluded Person Re-Identification via Relational Adaptive Feature Correction Learning
Occluded person re-identification (Re-ID) in images captured by multiple
cameras is challenging because the target person is occluded by pedestrians or
objects, especially in crowded scenes. In addition to the processes performed
during holistic person Re-ID, occluded person Re-ID involves the removal of
obstacles and the detection of partially visible body parts. Most existing
methods utilize the off-the-shelf pose or parsing networks as pseudo labels,
which are prone to error. To address these issues, we propose a novel Occlusion
Correction Network (OCNet) that corrects features through relational-weight
learning and obtains diverse and representative features without using external
networks. In addition, we present a simple concept of a center feature in order
to provide an intuitive solution to pedestrian occlusion scenarios.
Furthermore, we suggest the idea of Separation Loss (SL) for focusing on
different parts between global features and part features. We conduct extensive
experiments on five challenging benchmark datasets for occluded and holistic
Re-ID tasks to demonstrate that our method achieves superior performance to
state-of-the-art methods especially on occluded scene.Comment: ICASSP 202
Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation
Unsupervised video object segmentation (VOS) is a task that aims to detect
the most salient object in a video without external guidance about the object.
To leverage the property that salient objects usually have distinctive
movements compared to the background, recent methods collaboratively use motion
cues extracted from optical flow maps with appearance cues extracted from RGB
images. However, as optical flow maps are usually very relevant to segmentation
masks, the network is easy to be learned overly dependent on the motion cues
during network training. As a result, such two-stream approaches are vulnerable
to confusing motion cues, making their prediction unstable. To relieve this
issue, we design a novel motion-as-option network by treating motion cues as
optional. During network training, RGB images are randomly provided to the
motion encoder instead of optical flow maps, to implicitly reduce motion
dependency of the network. As the learned motion encoder can deal with both RGB
images and optical flow maps, two different predictions can be generated
depending on which source information is used as motion input. In order to
fully exploit this property, we also propose an adaptive output selection
algorithm to adopt optimal prediction result at test time. Our proposed
approach affords state-of-the-art performance on all public benchmark datasets,
even maintaining real-time inference speed
N-RPN: Hard Example Learning for Region Proposal Networks
The region proposal task is to generate a set of candidate regions that
contain an object. In this task, it is most important to propose as many
candidates of ground-truth as possible in a fixed number of proposals. In a
typical image, however, there are too few hard negative examples compared to
the vast number of easy negatives, so region proposal networks struggle to
train on hard negatives. Because of this problem, networks tend to propose hard
negatives as candidates, while failing to propose ground-truth candidates,
which leads to poor performance. In this paper, we propose a Negative Region
Proposal Network(nRPN) to improve Region Proposal Network(RPN). The nRPN learns
from the RPN's false positives and provide hard negative examples to the RPN.
Our proposed nRPN leads to a reduction in false positives and better RPN
performance. An RPN trained with an nRPN achieves performance improvements on
the PASCAL VOC 2007 dataset
NIR-to-VIS Face Recognition via Embedding Relations and Coordinates of the Pairwise Features
NIR-to-VIS face recognition is identifying faces of two different domains by
extracting domain-invariant features. However, this is a challenging problem
due to the two different domain characteristics, and the lack of NIR face
dataset. In order to reduce domain discrepancy while using the existing face
recognition models, we propose a 'Relation Module' which can simply add-on to
any face recognition models. The local features extracted from face image
contain information of each component of the face. Based on two different
domain characteristics, to use the relationships between local features is more
domain-invariant than to use it as it is. In addition to these relationships,
positional information such as distance from lips to chin or eye to eye, also
provides domain-invariant information. In our Relation Module, Relation Layer
implicitly captures relationships, and Coordinates Layer models the positional
information. Also, our proposed Triplet loss with conditional margin reduces
intra-class variation in training, and resulting in additional performance
improvements. Different from the general face recognition models, our add-on
module does not need to pre-train with the large scale dataset. The proposed
module fine-tuned only with CASIA NIR-VIS 2.0 database. With the proposed
module, we achieve 14.81% rank-1 accuracy and 15.47% verification rate of 0.1%
FAR improvements compare to two baseline models
Pixel-Level Equalized Matching for Video Object Segmentation
Feature similarity matching, which transfers the information of the reference
frame to the query frame, is a key component in semi-supervised video object
segmentation. If surjective matching is adopted, background distractors can
easily occur and degrade the performance. Bijective matching mechanisms try to
prevent this by restricting the amount of information being transferred to the
query frame, but have two limitations: 1) surjective matching cannot be fully
leveraged as it is transformed to bijective matching at test time; and 2)
test-time manual tuning is required for searching the optimal hyper-parameters.
To overcome these limitations while ensuring reliable information transfer, we
introduce an equalized matching mechanism. To prevent the reference frame
information from being overly referenced, the potential contribution to the
query frame is equalized by simply applying a softmax operation along with the
query. On public benchmark datasets, our proposed approach achieves a
comparable performance to state-of-the-art methods